Please check out our Shiny App: Happy Spotify for interactive visualizations.

Introduction

Spotify is an audio streaming platform, which provides access to over 50 million tracks. As of October 2019, the platform had 248 million monthly active users, including 113 million paying subscribers. Spotify offers daily ranking of songs in each region. It should be interesting if we can figure out the regular pattern of people listening to songs.

We want to explore the following four aspects:

Data sources

We used open-source data from Spotify, and collected the features of each song using its API. We got the list of the songs to focus on by scraping data from Spotify Charts which contains the daily top 200 songs.

Besides, we got the connection among singers using related-artists from Spotify. Due to the large amount of artists, we only selected top 100 singers as our dataset and studied their network.

Since Spotify itself does not offer track lyrics, we scraped lyrics from Genius. Since we don’t know the track ID in Genius, we have to use track name and artist name to match a song and get its lyrics. Songs can not be properly matched were ignored.

Also, due to the size of the dataset (there are too many songs right now!), it would be impossible to carefully analyze every song on the list. We focused on the top songs and the most popular singers.

Yichi Liu collected the daily ranking data of different countries, Rui Bai collected connections between artists, and Yuchen Pei collected lyrics of top songs.

Obstacle

Our raw data are quite messy. One song may have multiple track ids for different albums. Even some albums were multi-labeled. Hence, we need to locate the data by both Artist and Track.Name instead of only id. Moreover, since Spotify records songs from all over the world, there are many Greek characters and many other languages in track names and artist names. This is problematic when we scraped other information based on the track name and artist name. Those characters would cause meaningless word clouds in our further analysis. We thus substituted Greek characters with English characters instead and the meaningless data were dropped.

Data Transformation

Get Features

After scraping the daily data from Spotify Charts, there are only information about the track name, track id and artists. We wanted to get more information behind the songs. For each song, we crawled its features by its track id.

Also, since there are many songs involving collaboration, artist_id may contain multiple ids. Thus, we extracted the id for the main artist.

Get Yearly Data

The existing charts like Billboard and playlists for the top songs in Spotify are for 2018. Since we are already at the end of 2019, we want to get the latest one. Thus, we generated the yearly ranking of the songs in different countries during a year by adding up the daily streams of each song and ordering them by their total streams.

Get Lyrics

After scraping lyrics from Genius, since there are too many non-English characters in track names or artist names, we replaced them with corresponding English characters and then dropped additional information between parentheses or after horizontal bars.

Missing Values

For daily data, there are six missing patterns in total, and most of the data have no missing value. We noticed that some of the data have no track name or artist_id. We figured out the reason by looking at the website. For example, for the data with missing track_id, we checked the original website and found that there was indeed no song information of it, and their features could not be merged as well. Some other data have missing track name because of the website error. We dropped those data.

Data Overview

We used the global yearly ranking data (Top 100 from Nov.1 2018 to Oct.31 2019) to get an overview of the most popular songs and singers during this period.

Rank Track Artist Stream
1 Sunflower - Spider-Man: Into the Spider-Verse Post Malone 1066802067
2 bad guy Billie Eilish 907019009
3 thank u, next Ariana Grande 902085812
4 7 rings Ariana Grande 883900704
5 Señorita Shawn Mendes 875370863
6 Shallow Lady Gaga 788314764
7 Without Me Halsey 738657161
8 Happier Marshmello 731706076
9 Wow. Post Malone 721656407
10 I Don’t Care (with Justin Bieber) Ed Sheeran 710241659

Let’s consider top songs from another aspect. If one song stayed on the ranking for a long time, it should also be a popular song. We calculated the number of days that each song stays in Top 100 using the global daily dataset and drew a cleveland plot. Since we only cared about “top” songs, we dropped those songs which stayed on the ranking for less than 100 days.

There were 23 songs that stayed in Global Top 100 for the last whole year. Although few of them had a high ranking, they are still enduring and popular.

Among the 17 top singers, the most popular one was Post Malone, who had 6 songs in Top 100, while Billie Eilish, Ed Sheeran and XXXTENTACION each had 4 songs.

It is intersting to find that though Ariana Grande had two super pop songs, the number of her songs on the ranking was not the most. Therefore, when we judge whether an artist is hot, we should not only consider how popular his works are but also how many popular works he has.

For further investigation, we picked the top 4 singers including Post Malone, Billie Eilish, Ed Sheeran and XXXTENTACION.

Similarly, we studied the total number of days of each artist on the daily ranking last year. 25 singers stayed in the ranking for the whole year, including those 4 top singers we discussed above. Now it should be safer to draw the conclusion that these people are indeed the most popular singers worldwide (based on Spotify’s data).

Data Understanding

Before we start to analyze our data, we first need to understand the meaning of each audio feature. Explanations are from Spotify API.

Results

Streams Trend

To get an overview understanding of the data, it is important to find out how many songs people listen every day. Since we could not get the actual daily total streams among all songs, we computed the total streams for the Top 200 songs each day. The top songs are representative songs that people listened and the major components of the actual streams. Hence, it’s reasonable for the simulation.

As shown in the graph above, there is a clear cyclical trend. By hovering the mouse on the line, we could find that the differences between local peaks are always 7 days, which reminds us about a weekly trend. Thus, we further faceted by weekdays and identified an interesting finding, that the average total streams is much higher on Friday than Sunday. Also, the average is rising during the weekdays and dropping on weekends. This shows that people prefer to listen to music during weekdays, especially Friday, while their preference to music is lower on weekends.

This looks surprising at the first glance that people should preferred enjoying themselves on weekends, for example, by listening to music. One possible reason could be that people have a great amount of choices for relaxation during weekends. They could use the entire spare time to go camping, watch movies and spend time with family and friends. Although people’s desire for relaxation remains the same in weekdays, they only have fragmented spare time. Listening to music seems to be the best way to relax. Also, people enjoy Happy-Friday-Nights, so there exists a clear peak for Friday.

Furthermore, although seasonal preference to music does not exist in the graph, the total streams on Dec 24th, which is Christmas Eve, is extremely higher than others. The special trend does not appear in any other festival. Why is Christmas Eve an exception? Many famous Christmas songs come to our mind. Christmas is a festival of songs! To verify our hypothesis, we took a look at the top songs at Christmas Eve.

Rank Track Artist Streams
1 All I Want for Christmas Is You Mariah Carey 10819009
2 Last Christmas Wham! 9098668
3 Santa Tell Me Ariana Grande 7086794
4 It’s Beginning to Look a Lot like Christmas Michael Bublé 6877219
5 Jingle Bell Rock Bobby Helms 6040533
6 It’s the Most Wonderful Time of the Year Andy Williams 5960727
7 Rockin’ Around The Christmas Tree Brenda Lee 5768868
8 Happy Xmas (War Is Over) - Remastered John Lennon 5692945
9 Do They Know It’s Christmas? - 1984 Version Band Aid 5497071
10 Wonderful Christmastime [Edited Version] - Remastered 2011 / Edited Version Paul McCartney 5040731

Most of the top songs are indeed songs for Christmas. Christmas could remind people of those songs, which contributes to the high streams.

Popularity trend

Regardless of the overall trend, is there any typical popularity trends once a song is on board? We drew line chart for the streams v.s. the days since it is on board. By clustering by trend, we concluded that there are 3 types of popular songs and we drew the trend for one typical song in each group.

  • Type 1: falling. Those songs were listened the most when they were just on board. Their ranking fell as time goes by. The popularity of those songs is usually related to the reputation of the artist.

  • Type 2: rising before falling. They ranked high on the chart for a long time before falling down. This might because the songs are in good quality. Regardless of the effect of singers, people just loved the songs.

  • Type 3: rising. The total streams of those songs kept rising. And the rank of those songs moved to top and stayed on the top. Although they were not expected as a great song, their great melody engages people.

As concluded above, there is no strict popularity trend for a song. Some of them are popular at the beginning and fall soon while some of them are preferred by more and more people. It’s all about the quality of the song. So don’t be sad if the song is not popular at once and don’t be overconfident that the song will always be popular!

Feature Analysis

Distribution of features

In order to analyze track features, we first need to know the distribution of each feature, including danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo. We standardized the features so that they have the same scale to compare. The following boxplot shows their distribution.

It is clear that danceability, duration_ms, energy, loudness, tempo and valence were approximately evenly distributed around 0, which means people didn’t show special preference for these features. Acousticness, liveness and speechiness all had a mean below 0, along with some outliers, indicating that generally people prefer songs with low value of these features, but it does not necessarily mean that more lively songs could not be popular.

For instrumentalness, we found that most of its values were below 0, while some were very large. By recalling its definition–a predictor of whether a track contains no vocals, our finding can be explained because it’s normal that most of popular songs contain vocal content, resulting in low instrumentalness, and those outliers represent popular instrumental tracks.

Feature correlation

We are curious about whether there exists some correlations between each pair of these features, so we calculated their correlation matrix and drew a heatmap. It’s easy to find that only loudness and energy have a significantly positive correlation with coefficient 0.763 while other features seem to have no clear correlations.

Moreover, we are insterested in the influence of features on the ranking of a track. From the heatmap, we may conclude that feature values have nothing to do with yearly rank. This finding is really useful because now that people don’t have preference about the track features, artists don’t need to deliberately change their style to cater to the public tastes.

Feature time series

Line chart for each feature is drawn over the year after rescaling to make each start from 100.

The plot indicated an obvious exception on Dec 25th, 2018. On that day, people preferred songs with high acousticness, loudness and low dancability, energy and speechiness. Since it is mentioned that the loudness is positively related to energy, the result looks weird. As mentioned before, people loved Christmas songs on Christmas, for example, “White Christmas”, “It’s Beginning to Look a Lot like Christma”. We checked the features for the Christmas songs and found that those songs indeed have such special features combination.

Features world map

The interactive choropleth plot is available in Shiny App. By selecting different features in the App, we find that the distribution of the features also varies among countries.

For example, comparing the distribution of dancability and liveness, although people in Brazil do not prefer songs that are more suitable for dance, they love live music.

Some more findings are in Shiny App: for example, based on the distribution of valence and energy, South American people love positive and intensive songs more than North American people. Turkish people preferred instrumental music than spoken word music.

In total, although features have no direct correlation with the ranking, different song features are preferred at different time points and in different countries.

Singers Deep Dive

Genre of a singer

After standardizing the features, we got the radar chart of every singer as follows.

It’s easy to find that all their songs had low instrumentalness except for Billie Eilish’s “Bad Guy” and “Bury a Friend”.

Post Malone’s songs had high loudness, energy and dancebility. Unlike his other songs, “Wow” had relativily high speechiness since it is a rap song. And since “Sunflower” is a song involving collaboration, if we ignore its influence, we can see that the valence of his songs is not very high.

Ed Sheeran’s songs had high loudness, energy and dancebility and low tempo and liveness. Except for “Shape of you”, his top songs had low acousticness. The valence of “Perfect” is low for that it was a romantic ballad written about his fiancée.

The radar chart of Billie Eilish is very interesting since features of her different songs varied. The four songs all had a tempo within the medium range. “Wish you were a gay” is a song with extremely high danceability and liveness. “When the party’s over” is a song with extremely high acousticness and low liveness, energy, loudness. “Bury a friend” had high acousticness, speechiness and instrumentalness. “Bad guy” had high speechiness and relatively high valence. It indicates that Billie Eilish is not limited to one particular style.

For XXXTENTACION, features of his songs were somehow like those of Post Malone’s. His songs had relatively high loudness, energy, danceability. Tempo of his songs was not very high but speechiness was high.

Connection of Singers in Top 100

After getting related singers of each artist in Top 100, we constructed a network of these artists. Simular singers are connected by an edge, where similarity is based on analysis of the Spotify community’s listening history.

We found that many singers do not share similarity with other top singers. For example, BTS does not connect to any other artists in this network, which means people who like to listen BTS’s songs do not tend to listen songs from other singers in the top ranking. We have 15 such “lonely” singers, while others form several clusters with different colors in the network. Users that prefer Ed Sheeran’s songs may also like listening to Taylor Swift or James Arthur. Maluma, Ozuna, Lunay, Jhay Cortez and Dalex share relatively high similarity with one another. Another insteresting finding is that some singers are the “bridge” among multiple clusters. For example, Benny Blanco is the node that links three clusters together. People who listen to his songs have a preference on singers in the linked clusters as well.

Hey lyrics

Another interesting component of songs is the lyrics. We are interested in what different types of songs are talking about, and thus we explored this topic by visualizing the frequent words in word cloud format.

We have created an interactive word cloud (available in Shiny App) in which users have control over what songs to visualize by selecting ranges of three song features, i.e. danceability, energy, and loudness. Some other song features are not included due to their irrelevance to lyrics theme. For example, instrumentalness is not included since songs with different instrumentalness naturally have difference in lyrics lengths instead of lyrics content.

By playing around with the sliders, we have some interesting findings. For example, if we change the danceability slider to high and low, we can get the two clouds respectively as below.

(Note: Only the first cloud is directly rendered by wordclou2. Due to the issues with this package (see SO post, the other 5 clouds are webshotted and rendered in png format.)

  • For songs with high danceability (above 0.816):

We can see that many of the frequent words (with larger sizes) are swear words, which has a significant difference compared to songs with lower danceability (we will show later). We suspect that the songs selected are mostly hip-hop songs, since they naturally have higher danceability. Thus, our hypothesis is that hip-hop songs are more likely to contain profanities than others, which we will test out in later sections.

  • For songs with low danceability (below 0.556):

When selecting songs with lower danceability, the resulted frequent words in the cloud are more gentle and somewhat story-telling, such as “heart” and “love”. One possibility is that many songs with lower danceability tend to be softer with romantic themes.

Thus, we can see that songs with different feature ranges indeed have differet content.

In order to further explore the differences in lyrics for different song themes, we have plotted the word clouds for four different playlists, including:

  • Christmas playlist
## phantomjs has been installed to /Users/nessyliu/Library/Application Support/PhantomJS

  • Halloween playlist

  • hip-hop playlist

  • Romance playlist

By simple comparison, we can see that different song themes indeed have different lyrics content. It is expected for Christmas songs to have frequent words such as “bells”, “santa”, and “snow”. Similarly, for Halloween playlist, we can see frequent words including “monster” and “night”, which are indeed horror-related.

Also, for songs with a romantic theme, they tend to have soft and sweet words such as “love”, “baby”, “heart” and “kiss”, this is similar to songs with lower danceability as we have shown above. On the contrary, the hip-hop songs are again full of bad language, which is quite similar to what we discovered in songs with high danceability, and it has verified our hypothesis that songs with higher danceability are likely to be the hip-hop songs containing a large amount of swear words.

Conclusion

In this project, we tried to find out the relationship between popular songs and time, features, singers and lyrics based on the daily data from Spotify. There are many interesting findings.

The total streams are changing by time. People listen to songs more on Friday than on Sunday. Also, on holidays like Christmas, people especially love songs. Regardless of the overall trend, there is no special trend for popular songs. Diving deep into songs features, there is no relationship between the song features and its ranking. Hence producer should not try to fit in the trend to produce a popular song. However, the preference for features varies in different time and countries. For the popular singers, they maintain an obvious genre, while singers with similar styles are connected with each other. Finally, word cloud of lyrics indicated that the content of song is different when the features take different values.

All in all, this report gives readers more insights on the songs. We hope people could enjoy music more after reading this report! Hooray!